Two more fixes for non-UTF-8 tests by aitap · Pull Request #7681 · Rdatatable/data.table

aitap · 2026-03-22T13:43:23Z

Tests 1966.* failed on my Windows 7 VM where I test data.table with old versions of R. ö cannot be represented in CP1251, and enc2native() converted it to a plain unaccented o. If the characters cannot be represented in the ANSI encoding, we might as well skip the tests. (What if it returns NA or ? on a different system?)

Test 1164.1 shouldn't require the characters to be represented in the native encoding, because it only uses UTF-8 and Latin-1. Both match() and chmatch() offer a strong enough guarantee. Tested on the same Windows 7 VM, and also using LC_ALL=zh_CN.gb2312 luit R CMD check (GB2312 doesn't have ä or ß) and LC_ALL=C on GNU/Linux.

Not all ANSI encodings can represent accented Latin characters. For non-representable strings, enc2native() may substitute a different character (observed: "o" instead of "o with an umlaut), which fails a later comparison of the file name with the original, non-converted string.

Test 1164.1 should pass with UTF-8 encoded strings without converting them to the native encoding.

codecov · 2026-03-22T14:37:45Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.04%. Comparing base (7db13b9) to head (6c0c24e).
⚠️ Report is 8 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #7681   +/-   ##
=======================================
  Coverage   99.04%   99.04%           
=======================================
  Files          87       87           
  Lines       17031    17031           
=======================================
  Hits        16868    16868           
  Misses        163      163

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tdhock · 2026-03-30T19:54:23Z

thanks!
do you think it would be worth adding a CI job to test a different locale?
pandas does

aitap · 2026-04-08T13:21:11Z

Technically, lin-rel-vanilla currently tests with LC_ALL=C (because the container image doesn't have any locales installed). The language of the messages doesn't matter as much for R CMD check because (unlike interactive test.data.table()) it runs with LANGUAGE=en, but the session encoding does.

Testing with LC_CTYPE = fr_CA.ISO-8859-1, zh_CN.gb2312, ru_RU.KOI8-R might help prevent some future bugs, but not the 1966.* failure because that was a Windows-only problem. (I know there are "locale emulators" for Windows, but I haven't used them with R.)

aitap added 2 commits March 22, 2026 16:23

Drop native encoding requirement

6c0c24e

Test 1164.1 should pass with UTF-8 encoded strings without converting them to the native encoding.

aitap requested a review from MichaelChirico as a code owner March 22, 2026 13:43

tdhock approved these changes Mar 30, 2026

View reviewed changes

aitap merged commit 76ec862 into master Apr 8, 2026
13 checks passed

aitap deleted the native_file_enc branch April 8, 2026 13:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two more fixes for non-UTF-8 tests#7681

Two more fixes for non-UTF-8 tests#7681
aitap merged 2 commits intomasterfrom
native_file_enc

aitap commented Mar 22, 2026

Uh oh!

codecov bot commented Mar 22, 2026 •

edited

Loading

Uh oh!

tdhock commented Mar 30, 2026

Uh oh!

aitap commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aitap commented Mar 22, 2026

Uh oh!

codecov bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tdhock commented Mar 30, 2026

Uh oh!

aitap commented Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Mar 22, 2026 •

edited

Loading